Comparison of clustering and phenotyping approaches for subclassification of type 2 diabetes and its association with remission in Indian population

Identification of novel subgroups of type 2 diabetes (T2D) has helped improve its management. Most classification techniques focus on clustering or subphenotyping but not on both. This study aimed to compare both these methods and examine the rate of T2D remission in these subgroups in the Indian population. K-means clustering (using age at onset, HbA1C, BMI, HOMA2 IR and HOMA2%B) and subphenotyping (using homeostatic model assessment (HOMA) estimates) analysis was done on the baseline data of 281 patients with recently diagnosed T2D who participated in a 1-year online diabetes management program. Cluster analysis revealed three distinct clusters: severe insulin-deficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), and mild obesity-related diabetes (MOD) while subphenotyping showed four distinct categories: hyperinsulinemic, insulinopenic, classical, and nascent T2D. Comparison of the two approaches revealed that the clusters aligned with phenotypes based on shared characteristics of insulin sensitivity (IS) and beta cell function (BCF). Clustering correctly identified individuals in nascent group (high IS and BCF) as having mild obesity related diabetes which subphenotyping did not. Post-one-year intervention, higher remission rates were observed in the MOD cluster (p = 0.383) and the nascent phenotype showing high IS and BCF (p = 0.061, Chi-Square test). In conclusion, clustering based on a comprehensive set of parameters appears to be a superior method for classifying T2D compared with pathophysiological subphenotyping. Personalized interventions may be highly effective for newly diagnosed individuals with high IS and BCF and may result in higher remission rates in these individuals. Further large-scale studies are required to validate these findings.

Approximately 537 million adults around the world suffer from diabetes, with over 90% of these individuals having type 2 diabetes (T2D) 1 .A deeper understanding of the aetiopathogenesis of T2D may help arrest the progression of the disease and prevent diabetes-related complications.It is crucial to emphasize the clinical significance of identifying individuals at higher risk of developing diabetes complications, as this can provide valuable insights into the underlying pathological abnormalities and guide the selection of appropriate treatments 2 .By subclassifying T2D, healthcare providers may better manage the condition and offer more personalized treatment options for patients [2][3][4] .
Subclassifications based on phenotypes and genotypes have been reported previously.Studies in the Swedish, Danish, and Indian populations have classified T2D using either clustering or subphenotyping [3][4][5] .The clustering approach has resulted in the identification of several subgroups, including severe insulin-deficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), mild obesity-related diabetes (MOD), mild age-related diabetes (MARD), and one novel cluster reported in the Indian population, combined insulin-resistant and insulindeficient diabetes (CIRDD) 3 .The second method which is subphenotyping uses homeostatic model assessment (HOMA) for insulin sensitivity (IS) and beta cell function (BCF) because T2D is primarily characterised by beta cell dysfunction, insulin resistance, or a combination of both 6,7 .Consequently, subphenotype classification based on the factors determining insulin kinetics holds greater potential for delivering individualised treatment to T2D patients with improved efficacy.In the Danish population, a phenotypic classification system based solely on BCF and IS has identified three predominant subtypes: hyperinsulinemic, insulinopenic, and classical T2D 5 .
The importance of subclassifying diabetes has been highlighted in previous studies.For instance, the SIDD cluster has the highest risk of early retinopathy and neuropathy.Compared to patients in the MARD cluster, those in the SIRD cluster were five times more likely to develop chronic kidney disease (CKD) and end-stage renal disease (ESRD) 4 .Thus, subclassification may offer an opportunity for individualised treatment and better management of T2D.
Furthermore, T2D remission has become a crucial aspect of disease management 8 .The Association of British Clinical Diabetologists (ABCD) and the Primary Care Diabetes Society (PCDS) acknowledge the possibility of achieving T2D remission, provided beta cells function properly 9 .A comprehensive subclassification offers a deeper understanding of BCF and insulin sensitivity in individual patients as well as their potential for achieving T2D remission.Previous studies have reported the risk factors associated with these subgroups.Currently, no data are available on T2D remission based on the subclassification of T2D.Furthermore, most existing subclassifications have predominantly focused on either clustering or pathophysiological phenotyping but rarely on both.Therefore, our primary aim was to compare the two classification systems and examine the rate of T2D remission in these subgroups in the Indian population.

Study cohort
The current study was based on data from Freedom from diabetes clinic running an online diabetes management program.Clustering and subphenotyping were performed on retrospectively extracted baseline data, and remission was reported after one year of intensive lifestyle intervention.The eligibility criteria were recent diagnosis of T2D (based on WHO criteria 10 ), less than two years since diagnosis, availability of data on parameters used for classification (body mass index (BMI), glycated haemoglobin (HbA1C), plasma glucose levels, and fasting C-peptide), and data on endline HbA1C to define remission post-one-year intervention.
The patient flow is described in Fig. 1.Based on the availability of data on biochemical parameters for classification, a total of 281 T2D patients were included in the final analysis for subphenotyping and cluster analysis.All biochemical tests were performed at the National Accreditation Board for Testing and Calibration Laboratories (Government of India)-accredited laboratories and submitted by the patients during each consultation.The study adhered to ethical guidelines and all procedures involving human participants were conducted in accordance with the Declaration of Helsinki.Approval for this study was obtained from the institutional ethics committee of Dr. D. Y. Patil Vidyapeeth, Pune (DYPV/EC/138/16).

Subphenotyping
Homeostatic model assessment (HOMA) was used to categorise patients into different subphenotypes using methods described previously 5 .We used version 2 of the revised homeostatic model assessment (HOMA2 calculator) to estimate insulin sensitivity (HOMA2 IS) and beta cell function (HOMA2%B) based on fasting C-peptide and fasting plasma glucose values 11 .High and low values of IS and BCF were defined based on the median values of HOMA2 IS (53.9) and HOMA2%B (50) in a nondiabetic Indian population 12,13 .For each patient, HOMA2%B (Y-axis) was plotted against HOMA2 IS (X-axis) and using the cut-offs for HOMA2 IS and HOMA2%B, the

Cluster analysis
We applied the k-means clustering method using the k-means function (max iteration = 1000) and the same five variables (age at onset, BMI, HbA1C, HOMA2 IR and HOMA2%B) as reported previously 2 .The optimal number of clusters was evaluated using the silhouette method which is considered the most objective method 2,3,7,14 (Supplementary Figure S1).To test the robustness of the average silhouette method, we applied an alternative: the elbow method 3,7 (Supplementary Figure S2).The decision tree model was used to evaluate the effectiveness of the clustering methodology by training it on the dataset, with the variables used for clustering as features and the cluster labels as the target variable.The reported accuracy of 93% reflects the ability of the model to predict the assigned cluster labels based on the input features, serving as an internal validation of the robustness of the clustering approach for accurately categorising the data points into their respective clusters.As this study employed unsupervised clustering, true labels for comparison were not available; thus, the decision tree model was utilised for internal validation rather than comparison against external true labels.The cluster-forming tendency of the data was validated using a Hopkins statistical value 3,7 .The Hopkins statistic value of 0.2 suggests that there was a moderate clustering tendency in the dataset.Cluster-wise stability was computed using the Jaccard bootstrap method by resampling the dataset 2000 times; a stable cluster should yield a Jaccard similarity index greater than 0.75 3,7,15 (Supplementary Table S1).Cluster analysis was performed on the scaled and centred values.Cluster labels were assigned based on the phenotypic characteristics of individual cluster mean values of variables from previously published studies 2,3 .

Replication of Swedish clusters
Furthermore, we attempted to replicate the clusters identified in the Swedish population since we used the same variables used by them (age at onset, HbA1C, BMI, HOMA2 IR, and HOMA2%B) 2 .We used k-means clustering Fig. 2. Plot of insulin sensitivity and beta-cell function.The lines mark the distinction between the four phenotypes: hyperinsulinemic, insulinopenic, classical, and nascent type 2 diabetes; the colours mark the four subphenotypes identified in the main analysis; the values used as cutoff are from Indian non-diabetic population 11,12 .

Patient flow across the clusters and subphenotypes
We used a Sankey diagram to compare patient flow between the three identified clusters and four phenotypes.

Intervention
The 1-year online intensive lifestyle intervention comprised four integrated protocols: diet, exercise, psychological support, and medical management.The details of the protocol have been described previously 16 .Diet modifications included a customized plant-based diet, intermittent fasting for weight loss, and increased protein intake for muscle strengthening, which was introduced phasewise.The exercise protocol was focused on increasing and maintaining strength, flexibility, and stamina.Psychological support included group therapy focused on relieving stress and anxiety and improving the overall mental health of patients.Medical management included daily monitoring and drug dose adjustments by a physician through a dedicated mobile application, along with supplementation for micronutrient deficiencies.The primary mode of delivery of the intervention was online through video meetings, conferences, and group sessions.
Patients were advised to seek individual medical consultation once every three months to monitor their progress.Anthropometric and biochemical parameters were collected during the follow-up visits.Regular monitoring with monthly follow-up calls, 12 live monthly group sessions, and recorded exercise and recipe videos was performed to encourage adherence to the protocol.The total program duration was one year.Postintervention remission was defined as maintaining an HbA1C level < 48 mmol/mol for at least 3 months without the use of any glucose-lowering medications 17 .

Statistical analyses
Statistical analyses were performed using IBM SPSS ver.21 and Python (V.3.8).Within each group, the median (interquartile range) or mean ± standard deviation was reported based on data distribution.The significance of the difference between the group means for age at onset, BMI, fasting blood glucose, lipid profile, C-peptide, HOMA2 IR, HOMA2%B, and HbA1C was tested using the Kruskal-Wallis test (between more than two groups) and Mann-Whitney U test (between two groups).The Chi-Square test was used to examine the associations between categorical variables.Statistical significance was set at p-value < 0.05.

Results
At the time of enrolment, 65.6% of patients were on glucose-lowering medication, 27.0% were drug-naïve, and 7.4% used insulin in combination with glucose-lowering medication (oral hypoglycaemic agents).The median time from diagnosis to enrolment was 449 days [IQR 157-741 days].Among the 281 newly diagnosed T2D patients eligible for further phenotyping and clustering, the mean age was 42.3 ± 11.3 years, and 59.4% were male.

Subphenotyping
Figure 2 shows four distinct subphenotypes classified based on median values for IS and BCF in the non-diabetic Indian population as cutoffs 12,13 .Group 1 (lower right) was characterised by normal to high IS but severely reduced BCF (insulinopenic T2D) and accounted for 17.0% of the patients.Group 2 was characterised by low IS and reduced BCF (classical T2D), accounting for 14.6% of the patients.Group 3 was characterised by low IS but normal to high BCF (hyperinsulinemic T2D) and accounted for 40.9% of the patients.The fourth group was characterised by both high IS and high BCF (nascent T2D) and accounted for 27.4% of the patients.
Table 1 shows a comparison of anthropometric, biochemical, and medical parameters among the four subphenotypes.Compared with other subphenotypes, hyperinsulinemic patients showed the highest BMI (median 29.0 kg/m 2 ) (Kruskal Wallis test, p < 0.001), statin medication use (Chi-Square test, p = 0.800), antihypertensive medication use (Chi-Square test, p = 0.012), and history of heart disease, with a 30.4% remission rate (Chi-Square test, p = 0.383).Among all other categories, the classical T2D group showed a higher number of patients on insulin in combination with glucose-lowering medication, with significantly higher HbA1C (Kruskal Wallis test, p = 0.001), abnormal lipid profile (Kruskal Wallis test, p < 0.05), and the second highest use of statins (Chi-Square test, p = 0.800).The insulinopenic group showed the lowest use of statins and the highest prevalence of substance use with a 35.4% remission rate, although the difference was not statistically significant (Chi-Square test, p = 0.383).

Cluster analysis
Post cluster analysis, based on the silhouette score of 0.4, we considered a k value of 3 to be the optimal number of clusters with the following distributions: severe insulin-deficient diabetes (SIDD) (34.5%), severe insulinresistant diabetes (SIRD) (13.5%), and mild obesity-related diabetes (MOD) (52%) (Table 2).
The characteristics of the three clusters were examined (Table 2).The SIDD cluster was characterised by higher HbA1C and fasting blood glucose levels, and lower BMI, HOMA2 IR, HOMA2%B, and C-peptide levels compared to both the SIRD and MOD groups, respectively, as tested by the Mann-Whitney U test (p < 0.05).This cluster also had significantly higher total cholesterol and LDL levels than the other clusters (Mann-Whitney U test, p < 0.001).The SIRD cluster was characterised by the highest BMI, C-peptide levels, HOMA2 IR, and HOMA2%B compared with the other groups (Mann-Whitney U test, p < 0.001).The MOD cluster was characterised by obesity but not insulin resistance (HOMA2 IR).The remission rates were highest in the MOD group, followed by the SIDD and SIRD groups, with marginal significance (p = 0.061, Chi-Square test).www.nature.com/scientificreports/

Replication of Swedish clusters
Forced clustering with k = 4 (Silhouette score 0.35) was performed to replicate the original Swedish clusters since we used the same parameters as used by them.The analysis showed that of the four original clusters, we could replicate three in our cohort (SIDD, SIRD, and MOD), while a fourth cluster MARD was not identified.Instead, we identified a unique cluster-combined insulin-resistant and deficient diabetes (CIRDD), previously reported in the Indian Population 3 .Higher remission rates were observed in this cluster, followed by the MOD, SIRD, and SIDD clusters, although the difference was not statistically significant (Chi-Square test, p = 0.367) (Supplementary Table S3).Furthermore, this cluster merged with the MOD and SIRD clusters when optimal k-means clustering was performed (Supplementary Figure S3).

Comparison of clusters with subphenotypes
Considering the higher Silhouette score, we considered k = 3 clusters for comparison with the subphenotype classification.Since both methods used HOMA estimates as one of the components, we wanted to understand if there was any overlap between the two classifications.Initially, we used a Sankey diagram to assess the flow of patients between the two classifications.The analysis revealed that clusters merged with their respective phenotypes based on shared characteristics such as insulin sensitivity and beta cell function, clearly illustrating these associations.We observed that the SIRD cluster (92.1%) exhibited a hyperinsulinemic phenotype characterised by low insulin sensitivity i.e. insulin resistance and high beta cell function, similar to the SIRD cluster.In contrast, 45.4% and 42.3% of the patients in the SIDD cluster exhibited insulinopenic and classical phenotypes, respectively; Both phenotypes showed low beta cell function, similar to the SIDD clusters.Furthermore, 45.9% of the MOD cluster exhibited a nascent phenotype (Fig. 3).The nascent phenotype is characterised by high beta cell function and no insulin resistance (high insulin sensitivity), similar to the MOD cluster which explains the patients' flow to the cluster.In terms of remission, higher rates were observed in the MOD cluster, similar to the nascent phenotype.Lower remission rates were observed in the SIDD and SIRD clusters, similar to the hyperinsulinemic, insulinopenic, and classical phenotypes.

Discussion
For the first time, pathophysiological subphenotyping was performed in an Indian population.Among 281 individuals with recently diagnosed T2D, we identified four subphenotypes (hyperinsulinemic, insulinopenic, classical, and nascent) that differed in their characteristics and rates of T2D remission.Three of these categories www.nature.com/scientificreports/correspond to the subphenotypes identified in the Danish population 5 ; additionally, and we identified a category with high BCF and high IS (Nascent).This could be due to the relatively lower HOMA2%B cutoff for the Indian population, which is almost half that of the Danish population.The nascent category was excluded from the Danish study 5 ; however, we chose to retain the group.As this category includes patients with both high insulin sensitivity and good beta cell function, they are likely to be in the initial stages of developing diabetes.The term "nascent" describes the early stages of diabetes development when there is both high IS and high BCF.This implies that the condition is just beginning to emerge and develop, and early intervention could potentially reverse or manage diabetes progression.www.nature.com/scientificreports/ We were also able to identify three T2D clusters in the Indian population by applying data-driven cluster analysis using the key variables of age at onset, BMI, HbA1C, HOMA2 IR, and HOMA2%B used previously 2 .Post cluster analysis the distribution of variables used for clustering revealed a pattern similar to that observed in the Swedish study 2,3,7 .The three clusters identified in this study are SIDD, SIRD, and MOD.Contrary to previous findings, the MOD cluster showed the highest distribution of participants in our study cohort 2,3,18 .Furthermore, the SIRD cluster characteristics resembled those of the insulin-resistant obese diabetes (IROD) cluster identified previously in the Indian population 3 .The observation in a previous study in Swedish population that the risk of NAFLD (nonalcoholic fatty liver disease) is greater in SIRD patients than in MOD patients indicates that the severe insulin resistance observed in SIRD patients is not due to obesity alone 2 .Therefore, we preferred to use the term SIRD rather than the more generalised morphological classification of obesity in IROD 2,4,17 .
When we compared clusters with subphenotypes, clusters merged with their respective phenotypes based on shared clinical characteristics, clearly illustrating associations, contrary to a previous study where majority of the patients in SIDD and MARD cluster relocated to classical phenotype while SIRD cluster merged with hyperinsulinemic phenotype 7 .The remission rates in the clusters and phenotypes were comparable between the groups.Notably, the SIDD and SIRD clusters exhibited lower remission rates, mirroring those observed in the hyperinsulinemic, insulinopenic, and classical subphenotypes.This could be attributed to the underlying characteristics of these groups at baseline: the SIDD cluster's low beta cell function, similar to the insulinopenic and classical phenotypes, and the insulin resistance of the SIRD cluster, similar to the hyperinsulinemic phenotype, both of which contribute to their reduced likelihood of remission.It is worth discussing the nascent subphenotype observed in our cohort.This subphenotype comprised mostly of the MOD cluster and showed the lowest HbA1C and fasting blood glucose levels at baseline, similar to the MOD cluster which explains the highest remission rates among both groups.Previous studies have excluded this group from further analysis 5,7 .We chose to retain them because, despite showing high insulin sensitivity and high BCF, 74% of the patients were on either oral hypoglycaemic agents or insulin at the time of enrolment in the program and thus could not be labelled as nondiabetic.Therefore, in a clinical setting, excluding this category of patients merely based on HOMA2 IS and HOMA2%B values would be inappropriate.Therefore, even though subphenotyping offers a quick and convenient method of classifying patients in a clinical setting, special attention must be paid to those showing high IS and high BCF, who have higher chances of achieving remission.
The major strength of this study is that it reports both clustering and subphenotypic classification in the Indian population.This is the first study to report and compare T2D remission across phenotypes and clusters in the same population.One major limitation of this study is the relatively small sample size compared to other large-scale population-based studies, meaning that the findings of the study may not be applicable to the entire Indian population.Furthermore, data on additional parameters such as body composition parameters such as muscle mass, waist circumference, and waist-to-hip ratio were not available, which may have otherwise helped strengthen the findings of the study and provide further insights into the multifactorial nature of T2D.The online mode of the intervention may also have introduced variability in the adherence and the overall effectiveness of the program through variables such as access and knowledge of technology and its use, differing engagement levels, social support, and personal limitations.We employed various methods such as providing technology support for navigating the application, frequent reminders and scheduled monthly calls, feedback forms, monthly live sessions to increase engagement, personalised coaching through dedicated experts and customised interventions, and recognising and publishing success stories to encourage the participants.Furthermore, the intervention was not customised based on the subclassifications described since the classifications were performed retrospectively.Despite these limitations, our findings are clinically relevant, especially with respect to identifying the differing rates and possibility of T2D remission based on the subclassification of T2D.
In conclusion, this research highlights the importance of subclassification in the management and remission of diabetes, particularly in relation to T2D remission.It further suggests that targeted personalised interventions, may be particularly effective for individuals who are newly diagnosed with high insulin sensitivity and good beta cell function and may result in higher chances of remission.Future large-scale studies that explore the potential impact of lifestyle interventions based on T2D subclassification may provide valuable insights for the development of effective treatment plans for the management of diabetes and its associated complications.

Table 2 .
Comparison of anthropometric, biochemical, and medical characteristics in the three optimal k-value clusters.Parameters in bold are used for clustering; data for all parameters are presented as mean ± standard deviation or median (interquartile range) or frequency (%); BMI, body mass index; HbA1C, glycated hemoglobin; HOMA2 IR, homeostatic model assessment of insulin resistance; HOMA2%B, homeostatic model assessment of beta-cell function (HDL), high-density lipoprotein; LDL, low-density lipoprotein; OHAs, oral hypoglycemic agents.*Smoking or alcohol or tobacco or a combination of any two; Kruskal-Wallis test used to compare difference in mean between more than one group; Significance of difference between categorical variables was tested using Chi-Square test; Mann Whitney U test was used to test significance of the difference in means between 2 groups.a significantly different from SIDD.b significantly different from SIRD.

Table 1 .
Comparison of anthropometric, biochemical, and medical characteristics in the four identified Subphenotypes.Data for all parameters are presented as mean ± standard deviation or median (interquartile range) or frequency (%); BMI, body mass index; HbA1C, glycated haemoglobin; HDL, high-density lipoprotein; LDL, low-density lipoprotein; OHAs, oral hypoglycemic agents.*Smoking or alcohol or tobacco or a combination of any two; Kruskal-Wallis test used to compare difference in mean between more than one group; Significance of difference between categorical variables was tested using Chi-Square test; Mann Whitney U test was used to test significance of the difference in means between 2 groups.a significantly different from hyperinsulinemic.b significantly different from insulinopenic; c significantly different from classical.